House prices report

This document is a data science report of the kaggle house prices tutorial project. It was generated using the Shapash library.

General Information

Version : 0.7

Name : House Prices Prediction Project

Purpose : Predicting the sale price of houses

Date : 2021-11-11

Contributors : Yann Golhen, Sebastien Bidault, Thomas Bouche, Guillaume Vignal, Thibaud Real

Description : This work is a data science project that tries to predict the sale of houses based on 79 explanatory variables. It was designed inside the data science team at X. and improved since the beggining of the project in 2019. The model was put into production since February 2021.

Git Commit : 1ff46e83beafba8949a7f3b7de27586acd6ae99e


Dataset Information

Origin : The Assessor’s Office

Description : the sale of individual residential property in Ames, Iowa

Depth : from 2006 to 2010

Perimeter : only residential sales

Target Variable : SalePrice

Target Description : The property's sale price in dollars


Data Preparation

Variable Filetring : All variables that required special knowledge or previous calculations for their use were removed

Individual Filtering : only the most recent sales data on any property were kept (for houses that were sold multiple times during this period)

Missing Values : were replaced by 0

Feature Engineering : No feature was created. All features are directly taken from the kaggle dataset. Categorical features were transformed using an ordinal encoder.


Model Training

Used Algorithm : We used a RandomForestRegressor algorithm (scikit-learn) but this model could be challenged with other interesting models such as XGBRegressor, Neural Networks, ...

Parameters Choice : We did not perform any hyperparameter optimisation and chose to use n_estimators=50. Future works should be planned to perform gridsearch optimizations

Metrics : Mean Squared Error metric

Validation Strategy : We splitted our data into train (75%) and test (25%)


Model information

Model used : RandomForestRegressor

Library : sklearn.ensemble._forest

Library version : 0.24.1

Model parameters :

Parameter key Parameter value
base_estimator DecisionTreeRegressor()
n_estimators 50
estimator_params ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'random_state', 'ccp_alpha')
bootstrap True
oob_score False
n_jobs None
random_state None
verbose 0
warm_start False
class_weight None
max_samples None
criterion mse
max_depth None
Parameter key Parameter value
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0.0
max_features auto
max_leaf_nodes None
min_impurity_decrease 0.0
min_impurity_split None
ccp_alpha 0.0
n_features_in_ 72
n_features_ 72
n_outputs_ 1
base_estimator_ DecisionTreeRegressor()
estimators_ [DecisionTreeRegressor(max_features='auto', random_state=794328667), DecisionTreeRegressor(max_features='auto', random_state=829974529), DecisionTreeRegressor(max_features='auto', random_state=162270107), DecisionTreeRegressor(max_features='auto', random_state=1992995699),...

Dataset analysis

Global analysis

Training dataset Prediction dataset
number of features 72 72
number of observations 1,095 365
missing values 0 0
% missing values 0 0

Univariate analysis

INFO:numexpr.utils:NumExpr defaulting to 8 threads.

1stFlrSF - Numeric

First Floor square feet
Training dataset Prediction dataset
count 1,095 365
mean 1,180 1,120
std 400 341
min 334 483
25% 886 864
50% 1,100 1,050
75% 1,420 1,320
max 4,690 2,630

Target analysis

SalePrice - Numeric

Training dataset Prediction dataset
count 1,095 365
mean 182,000 177,000
std 78,500 82,000
min 34,900 40,000
25% 130,000 126,000
50% 165,000 160,000
75% 215,000 205,000
max 755,000 745,000

Multivariate analysis


Model explainability

Note : the explainability graphs were generated using the test set only.

Global feature importance plot

Features contribution plots

1stFlrSF -

First Floor square feet

Model performance - Validation Set

Univariate analysis of target variable

SalePrice - Numeric

True values Prediction values
count 365 365
mean 177,000 177,000
std 82,000 69,300
min 40,000 63,800
25% 126,000 128,000
50% 160,000 159,000
75% 205,000 198,000
max 745,000 536,000

Metrics

Mean absolute error : 16,600

Mean squared error : 632,000,000


The graph below represents y_pred vs y_test :

CONFUSION MATRIX

You can add as many graphs, text, or other cells as you want.

The code will not be displayed. Only the markdown and output of the cells will be shown on the generated html file.

Stability Index

The Stability Index for  1stFlrSF :  0.054006641424508166
The Stability Index for  2ndFlrSF :  0.034436569164320176
The Stability Index for  3SsnPorch :  0.01719784169035784
The Stability Index for  BedroomAbvGr :  0.013789862414340312
The Stability Index for  BldgType :  0.3824328672343406
The Stability Index for  BsmtCond :  5.549098486292586
The Stability Index for  BsmtExposure :  3.540575124929754
The Stability Index for  BsmtFinSF1 :  0.04722680551466319
The Stability Index for  BsmtFinSF2 :  0.08937360519076179
The Stability Index for  BsmtFinType1 :  1.7406433687743188
The Stability Index for  BsmtFinType2 :  5.747471358530822
The Stability Index for  BsmtFullBath :  0.01948428358867021
The Stability Index for  BsmtHalfBath :  0.0030476803573412903
The Stability Index for  BsmtQual :  2.2089222827238193
The Stability Index for  BsmtUnfSF :  0.00992635348082776
The Stability Index for  CentralAir :  0.48817432906301794
The Stability Index for  Condition1 :  8.640895979688906
The Stability Index for  Condition2 :  12.747415809814878
The Stability Index for  Electrical :  11.504811832067588
The Stability Index for  EnclosedPorch :  0.018283232452415286
The Stability Index for  ExterCond :  16.242794802790844
The Stability Index for  ExterQual :  9.106108224770106
The Stability Index for  Exterior1st :  5.698360880459419
The Stability Index for  Exterior2nd :  4.175586494079361
The Stability Index for  Fireplaces :  0.004688364174025415
The Stability Index for  Foundation :  0.7775915824743733
The Stability Index for  FullBath :  0.014307653816356028
The Stability Index for  Functional :  13.609871038041822
The Stability Index for  GarageArea :  0.042537105422203986
The Stability Index for  GarageCond :  6.944761111201195
The Stability Index for  GarageFinish :  1.9784896764330426
The Stability Index for  GarageQual :  7.70346460805724
The Stability Index for  GarageType :  8.632373874527254
The Stability Index for  GarageYrBlt :  0.07490718454298781
The Stability Index for  GrLivArea :  0.021155765014728867
The Stability Index for  HalfBath :  0.007060437470689766
The Stability Index for  Heating :  0.0004415056023075196
The Stability Index for  HeatingQC :  2.7808152198921774
The Stability Index for  HouseStyle :  3.196777792254942
The Stability Index for  KitchenAbvGr :  0.02314690705057906
The Stability Index for  KitchenQual :  2.3694295218022576
The Stability Index for  LandContour :  6.446640743539879
The Stability Index for  LandSlope :  11.232801853580499
The Stability Index for  LotArea :  0.023994823468863485
The Stability Index for  LotConfig :  5.428273089590677
The Stability Index for  LotShape :  2.8298547297403567
The Stability Index for  LowQualFinSF :  0.02556844109038931
The Stability Index for  MSSubClass :  1.516077731759784
The Stability Index for  MSZoning :  6.274363797174393
The Stability Index for  MasVnrArea :  0.0551858812946571
The Stability Index for  MasVnrType :  0.02459961911818724
The Stability Index for  MiscVal :  0.015198882072831273
The Stability Index for  MoSold :  0.012675762033250049
The Stability Index for  Neighborhood :  0.24575009307083714
The Stability Index for  OpenPorchSF :  0.028661061040918062
The Stability Index for  OverallCond :  0.061379799987294655
The Stability Index for  OverallQual :  0.03777989457191217
The Stability Index for  PavedDrive :  6.3563654192865915
The Stability Index for  PoolArea :  0.01412518652462114
The Stability Index for  RoofMatl :  12.659021939768715
The Stability Index for  RoofStyle :  1.1752678311038083
The Stability Index for  SaleCondition :  9.703485338076057
The Stability Index for  SaleType :  10.300915690282656
The Stability Index for  ScreenPorch :  0.06870060110730973
The Stability Index for  Street :  0.035828095107122156
The Stability Index for  TotRmsAbvGrd :  0.051711977207815736
The Stability Index for  TotalBsmtSF :  0.036470243512358815
The Stability Index for  Utilities :  9.209419337938986
The Stability Index for  WoodDeckSF :  0.04692981933019543
The Stability Index for  YearBuilt :  0.018847367925547868
The Stability Index for  YearRemodAdd :  0.057528132951586076
The Stability Index for  YrSold :  0.0037448436190685333